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Abstract. Increasingly, business projects are ephemeral. New Business Intelli- 
gence tools must support ad-lib data sources and quick perusal. Meanwhile, tag 
clouds are a popular community-driven visualization technique. Hence, we inves- 
tigate tag-cloud views with support for OLAP operations such as roll-ups, slices, 
dices, clustering, and drill-downs. As a case study, we implemented an applica- 
tion where users can upload data and immediately navigate through its ad hoc 
dimensions. To support social networking, views can be easily shared and em- 
bedded in other Web sites. Algorithmically, our tag-cloud views are approximate 
range top-k queries over spontaneous data cubes. We present experimental evi- 
dence that iceberg cuboids provide adequate online approximations. We bench- 
mark several browser-oblivious tag-cloud layout optimizations. 
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1 Introduction 

The Web 2.0, or Social Web, is about making available social software applications on 
the Web in an unrestricted manner. Enabling a wide range of distributed individuals 
to collaborate on data analysis tasks may lead to significant productivity gains (TJEl- 
Several companies, like SocialText and IBM, are offering Web 2.0 solutions dedicated 
to enterprise needs. The data visualization Web sites Many Eyes [ 3 ] and Swivel El have 
become part of the Web 2.0 landscape: over 1 million data sets were uploaded to Swivel 
in less than 3 months l5l . 

These Web 2.0 data visualization sites use traditional pie charts and histograms, but 
also tag clouds. Tag clouds are a form of histogram which can represent the amplitude 
of over a hundred items by varying the font size. The use of hyperlinks makes tag 
clouds naturally interactive. Tag clouds are used by many Web 2.0 sites such as Flickr, 
del.icio.us and Technorati. Increasingly, e-Commerce sites such as Amazon or O'Reilly 
Media, are using tag clouds to help their users navigate through aggregated data. 

Meanwhile, OLAP (On-Line Analytical Processing) [6] is a dominant paradigm 
in Business Intelligence (BI). OLAP allows domain experts to navigate through ag- 
gregated data in a multidimensional data model. Standard operations include drill- 
down, roll-up, dice, and slice. The data cube 1 7 ] model provides well-defined semantics 
and performance optimization strategies. However, OLAP requires much effort from 
database administrators even after the data has been cleaned, tuned and loaded: schemas 
must be designed in collaboration with users having fast changing needs and require- 
ments |8, 9]. Vendors such as Spotfire, Business Objects and QlikTech have reacted by 
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proposing a new class of tools allowing end-user to customize their applications and to 
limit the need for centralized schema crafting ifTOl . 

OLAP itself has never been formally defined though rules have been proposed to 
recognize an OLAP application [6]. In a similar manner, we propose rules to recognize 
Web 2.0 OLAP applications (see also Table [T]): 

1. Data and schemas are provided autonomously by users. 

2. It is available as a Web application. 

3. It supports complete online interaction over aggregated multidimensional data. 

4. Users are encouraged to collaborate. 

Tag clouds are well suited for Web 2.0 OLAP. They are flexible: a tag cloud can 
represent a dozen or hundred different amplitudes. And they are accessible: the only 
requirement is a browser that can display different font sizes. They also spark discus- 
sion fiTTl . 

We describe a tag-cloud formalism, as an instance of Web 2.0 OLAP. Since we im- 
plemented a prototype, technical issues will be discussed regarding application design. 
In particular, we used iceberg cubes lfT2l to generate tag clouds online when the data 
and schema are provided extemporaneously. Because tag clouds are meant to convey 
a general impression, presenting approximate measures and clustering is sufficient: we 
propose specific metrics to measure the quality of tag-cloud approximations. We con- 
clude the paper with experimental results on real and synthetic data sets. 

Table 1. Conventional OLAP versus Web 2.0 OLAP 



Conventional OLAP 


Web 2.0 OLAP 


recurring needs 
predefined schemas 
centralized design 
histograms 
plots and reports 
access control 


ephemeral projects 
spontaneous schemas 
user initiative 
tag clouds 

iframes, wikis, blogs 
social networking 



2 Related Work 

There are decentralized models fl3l and systems iTH to support collaborative data 
sharing without a single schema. 

According to Wu et al., it is difficult to navigate an OLAP schema without help; 
they have proposed a keyword-driven OLAP model 1 15]. There are several OLAP vi- 
sualization techniques including the Cube Presentation Model (CPM) fl6l . Multiple 
Correspondence Analysis (MCA) ifTTl and other interactive systems fl8l . 

Tag clouds have been popularized by the Web site Flickr launched in 2004. Several 
optimization opportunities exist: similar tags can be clustered together [19], tags can be 
pruned automatically l20l or by user intervention |2T1L tags can be indexed | 21 ], and so 
on. Tag clouds can be adapted to spatio-temporal data l22ll23l . 
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3 OLAP formalism 

3.1 Conventional OLAP Formalism 



Most OLAP engines rely on a data cube 1 7 ] . A data cube C contains a non empty set of d 
dimensions *D = {A'}i</<d an d a non empty set of measures fW. Data cubes are usually 
derived from a fact table (see Table [2]) where each dimension and measure is a column 
and all rows (or facts) have disjoint dimension tuples. Figure 1(a) gives tridimensional 
representation of the data cube. 



Table 2. Fact table example 



Dimensions 


Measures 


location 


time 


salesman product 


cost profit 


Montreal 


March 


John 


shoe 


100$ 


10$ 


Montreal 


December Smith 


shoe 


150$ 


30$ 


Quebec 


December Smith 


dress 


175$ 


45 $ 


Ontario 


April 


Kate 


dress 


90$ 


10$ 


Paris 


March 


John 


shoe 


100$ 


20$ 


Paris 


March 


Marc 


table 


120$ 


10$ 


Paris 


June 


Martin 


shoe 


120$ 


5$ 


Lyon 


April 


Claude 


dress 


90$ 


10$ 


New York October 


Joe 


chair 


100$ 


10$ 


New York May 


Joe 


chair 


90$ 


10$ 


Detroit 


April 


Jim 


dress 


90$ 


10$ 




Paris-March-Table 
Lyon-April-shoe 
New York-April -dress 

Mont real -December-shoe 

Montreal-March-shoe 
New York-October-chair 
New York-May-chair 
Paris-March-shoe Paris-June -shoe 

Quebec-December-dress 



Roll-up on product 



(a) OLAP data cube (b) Tag-cloud 
cube 




Lyon-April 

Montreal-December 

Montreal-March 
New York-April New York-May 
New York-October Ontario-April 
Paris-March Paris-June 

Quebec-December 



data (c) OLAP roll-up (d) Tag-cloud roll-up 



Dice on the first year semester 



US 




1 1 Detroit 

\ IJjjj 

Icanada; Ontario MOU lll' ■ Table!^ 
UontreallHLO Shoe^ LL 
lit time 



New York-May-chair 

Lyon-April-shoe 
New York-April -dress 
Ontario-April-dress 
Montreal-March-shoe 
Paris-March-Table 

Paris-March-shoe 



location Slice where product='shc 

Detroit rffityTTT^lt i 

Paris ijTjJXl] 
! Lyon DHL 

Quebec j |J |J 



[canadajjOntar 



145 



■Montreal T^nnnnn 



Lyon-April 
Detroit-April New York-May 
New York-October Ontario-April 
Paris-March 

Quebec-December 



(e) OLAP dice 



(f) Tag-cloud dice 



(g) OLAP slice 



(h) Tag-cloud slice 



Fig. 1. Conventional OLAP operations vs. tag-cloud OLAP operations 
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Measures can be aggregated using several operators such as average, max, min, 
SUM, and COUNT. All of these measures and dimensions are typically prespecified in a 
database schema. Database administrators preaggregate views to accelerate queries. 

The data cube supports the following operations: 

- A slice specifies that you are only interested in some attribute values of a given 
dimension. For example, one may want to focus on one specific product (see Fig- 
l(g)| ). Similarly, a dice selects ranges of attribute values (see Figure [T(e)] ). 



ure 



A roll-up aggregates the measures on coarser attribute values. For example, from the 
sales given for every store, a user may want to see the sales aggregated per country 
(see Figure l(c)| ). A drill-down is the reverse operation: from the sales per country, 



one may want to explore the sales per store in one country. 
The various specific multidimensional views in Figure [T] are called cuboids. 

3.2 Tag-Cloud OLAP Formalism 

A Web 2.0 OLAP application should be supported by a flexible formalism that can 
adapt a wide range of data loaded by users. Processing time must be reasonable and 
batch processing should be avoided. 

Unlike in conventional data cubes, we do not expect that most dimensions have 
explicit hierarchies when they are loaded: instead, users can specify how the data is laid 
out (see Section [5]). As a related issue, the dimensions are not orthogonal in general: 
there might be a "City" dimension as a well as "Climate Zone" dimension. It is up to 
the user to organize the cities per climate zone or per country. 

Definition 1 (Tag). A tag is a term or phrase describing an object with corresponding 
non-negative weights determining its relative importance. Hence, a tag is made of a 
triplet (term, object, weight). 

As an example, a picture may have been attributed the tags "dog" (12 times) and 
"cat" (20 times). In a Business Intelligence context, a tag may describe the current state 
of a business. For example, the tags "USA" (16,000$) and "Canada" (8,000$) describe 
the sales of a given product by a given salesman. 

We can aggregate several attribute values, such as "Canada" and "March," into a 
single term, such as "Canada-March." A tag composed of k attribute values is called a 
&-tag. Figure 1(b) shows a tag cloud representation of Table [2] using 3 -tags. 



Each tag T is represented visually using a font size, font color, background color, 
area or motif, depending on its measure values. 

3.3 Tag-Cloud Operations 

In our system, users can upload data, select a data set, and define a schema by choos- 
ing dimensions (see Figure [2]). Then, users can apply various operations on the data 
using a menu bar. On the one hand, OLAP operations such as slice, dice, roll-up 
and drill-down generate new tag clouds and new cuboids from existing cuboids. Fig- 



ures 1(d) 1(f) and 1(h) show the results of a roll-up, a dice, and a slice as tag clouds. 
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On the other hand, we can apply some operations on an existing tag cloud: sort by either 
the weights or the terms of tags, remove some tags, remove lesser weighted tags, and 
so on. We estimate that a tag cloud should not have more than 150 tags. 



Tag Clouds 

List of stored data cubes | swivel - | description 



r List of dimensions 













Number 


StreetAddress 


EmailAddress 




Gender 


City 


TelephoneNumber 




GivenName 


□ State 


MothersMaiden 




□ Middlelnitial 


□ ZipCode 


Birthday 




Surname 


Country 


CCType 






Tsn nlniirl nnprstinn; 




Project Roll-Up Slice Dice Dril 


-Down 






Fig. 2. User-driven schema desi 


gn 





Tag-cloud layout has measurable benefits when trying to convey a general impres- 
sion l24l . Hence, we wish to optimize the visual arrangement of tags. Chen et al. pro- 
pose the computation of similarity measures between cuboids to help users explore 
data l25l : we apply this idea to define similarities between tags. First of all, users are 
asked to provide one or several dimensions they want to use to cluster the tags. Choos- 
ing the "Country" dimension would mean that the user wants the tags rearranged by 
countries so that "Montreal-April" and "Toronto-March" are nearby (see Figure [3J. 
The clustering dimensions selected by the user together with the tag-cloud dimensions 
form a cuboid: in our example, we have the dimensions "Country," "City," and "Time." 
Since a tag contains a set of attribute values, it has a corresponding subcuboid defined 
by slicing the cuboid. 



-Select tag-support and similarity dimensions — 
You have to select at least one tag-support dimension. 
Cuboid's dimensions Tag-support dimensions 

location 
time 



Clustering dimensions 
country 



Add 



Add 



Remove 



Remove 



You have selected 3 dimension(s) 

(*) The attribute values of the selected dimensions are combined to derive a tag cloud. 
(**) The selected dimensions are used to cluster the tags, 



TagCloud 



Fig. 3. Choosing similarity dimensions 
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Several similarity measures can be applied between subcuboids: Jaccard, Euclidean 
distance, cosine similarity, Tanimoto similarity, Pearson correlation, Hamming dis- 
tance, and so on. Which similarity measure is best depends on the application at hand, so 
advanced users should be given a choice. Commonly, similarity measures take up val- 
ues in the interval [—1,1]. Similarity measures are expected to be reflexive (/(a, a) = 1), 
symmetric (/(a,b) = /(b,a)) and transitive: if a is similar to b, and b is similar to c, 
then a is also similar to c. 

Recall that given two vectors v and w, the cosine similarity measure is defined 

as cos(v, w) = Y.i v i w i/ \J Hi v j Hi w j = v/\v\ ■ w/\w\. The Tanimoto similarity is given 

by Hi v i w i/(Hi v j + — Hi v i w i)\ it becomes the Jaccard similarity when the vec- 
tors have binary values. Both of these measures are reflexive, symmetric and tran- 
sitive. Specifically, the cosine similarity is transitive by this inequality: cos(v,z) > 
cos(w,z) — yl — cos(v,w) 2 . To generalize the formulas from vectors to cuboids, it suf- 
fices to replace the single summation by one summation per dimension. Figure]?] shows 
an example of tag-cloud reordering to cluster similar tags. In this example, the "City- 
Product" tags were compared according to the "Country" dimension. The result is that 
the tags are clustered by countries. 



Without similarity 

Detroit-dress Quebec-dress Paris-table Toronto-dresd 
Paris-shoe Montreal-shoe Lyon-dress New York-chair 

With similarity 

Detroit-dress New York-chair QuebeC-dreSS 

Toronto-dressMontreal-ShoeParis-tablePariS-ShoeLyon-dress 



Fig. 4. Tag-cloud reordering based on similarity 



4 Fast Computation 

Because only a moderate number of tags can be displayed, the computation of tag 
clouds is a form of top-/: query: given any user-specified range of cells, we seek the 
top-/: cells having the largest measures. There is a little hope of answering such queries 
in near constant-time with respect to the number of facts without an index or a buffer. 
Indeed, finding all and only the elements with frequency exceeding a given frequency 
threshold l26l or merely finding the most frequent element [27] requires £l(m) bits 
where m is the number of distinct items. 

Various efficient techniques have been proposed for the related range MAX prob- 
lem [ 28, 29 ], but they do not necessarily generalize. Instead, for the range top-/: prob- 
lem, we can partition sparse data cubes into customized data structures to speed up 
queries by an order of magnitude |30l|3T][32l. We can also answer range top-/: queries 
using RD-trees [33 ] or R-trees l34l . In tag clouds, precision is not required and accu- 
racy is less important; only the most significant tags are typically needed. Further, if all 
tags have similar weights, then any subset of tag may form an acceptable tag cloud. 
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A strategy to speed up top-/: queries is to transform them into comparatively easier 
iceberg queries lfT2l . For example, in computing the top- 10 (k = 10) best vendors, one 
could start by finding all vendors with a rating above 4/5. If there are at least 10 such 
vendors, then sorting this smaller list is enough. If not, one can restart the query, seek- 
ing vendors with a rating above 3/5. Given a histogram or selectivity estimates, we can 
reduce the number of expected iceberg queries |[35ll . Unfortunately, this approach is not 
necessarily applicable to multidimensional data since even computing iceberg aggre- 
gates once for each query may be prohibitive. However, iceberg cuboids can still be 
put to good use. That is, one materializes the iceberg of a cuboid, small enough to fit 
in main memory, from which the tag clouds are computed. Intuitively, a cuboid rep- 
resenting the largest measures is likely to provide reasonable tag clouds. Users mostly 
notice tags with large font sizes l24ll . A good approximation captures the tags having 
significantly larger weights. To determine whether a tag cloud has such significant tags, 
we can compute the entropy. 

Definition 2 (Entropy of a tag cloud). Let T e T be a tag from a tag cloud ( T, then 
entropy(<T) = - Lie* P(T)log(p(T)) where p(T) = Lx ^il%\ x) - 

The entropy quantifies the disparity of weights between tags. The lower the entropy, 
the more interesting the corresponding tag cloud is. Indeed, tag clouds with uniform tag 
weights have maximal entropy and are visually not very informative (see Figure [5]). 

Download as XML| Permanent link | Sort | Iceberg | TopN | Strip tags | Undo 

Salt Lake City Baltimore Saint Louis Irvine Louisville Richmond Tampa Milwaukee Orlando 
Birmingham Jacksonville Oklahoma City Kansas City Minneapolis Austin Grand Rapids 
Cambridge Pittsburgh Springfield Columbus Oakland Indianapolis Denver Miami Nashville 
Newark Jackson San Antonio 

time elapsed: 125 ms, aggregating over {City}, showing {City} 

Fig. 5. Example of non informative tag cloud 



We can measure the quality of a low-entropy tag cloud by measuring false positives 
and negatives: false positive happens when a tag has been falsely added to a tag cloud 
whereas a false negative occurs when a tag is missing. These measures of error assume 
that we limit the number of tags to a moderately small number. We use the following 
quality indexes; index values are in [0, 1] and a value of is ideal; they are not applicable 
to high-entropy tag clouds. 

Definition 3. Given approximate and exact tag clouds A and E, the false-positive and 

false-negative indexes are m ^ A ^ E weight{t) and max ^Aweight(t) 
jaise negative inaexes are maXfeAWeight ^ ana m80ifeEWeight ^ • 



5 Tag-Cloud Drawing 

While we can ensure some level of device-independent displays on the Web, by us- 
ing images or plugins, text display in HTML may vary substantially from browser to 
another. There is no common set of font browsers are required to support, and Web 
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standards do not dictate line-breaking algorithms or other typographical issues. It is not 
practical to simulate the browser on a server. Meanwhile, if we wish to remain accessi- 
ble and to abide by open standards, producing HTML and ECMAScript is the favorite 
option. 

Given tag-cloud data, the tag-cloud drawing problem is to optimally display the 
tags, generally using HTML, so that some desirable properties are met, including the 
following: (1) the screen space usage is minimized; (2) when applicable, similar tags 
are clustered together. Typically, the width of the tag cloud is fixed, but its height can 
vary. 

For practical reasons, we do not wish for the server to send all of the data to the 
browser, including a possibly large number of similarity measures between tags. Hence, 
some of the tag-cloud drawing computations must be server-bound. There are two pos- 
sible architectures. The first scenario is a browser- aware approach [19]: given the tag- 
cloud data provided by the server, the browser sends back to the server some display- 
specific data, such as the box dimensions of various tags using different font sizes. 
The server then sends back an optimized tag cloud. The second approach is browser- 
oblivious: the server optimizes the display of the tag cloud without any knowledge of 
the browser by passing simple display hints. The browser can then execute a final and 
inexpensive display optimization. While browser-oblivious optimization is necessarily 
limited, it has reduced latency and it is easily cacheable. 

Browser-oblivious optimization can take many forms. For example, we could send 
classes of tags and instruct the browser to display them on separate lines [ 20 1. In our 
system, tags are sent to the browser as an ordered list, using the convention that succes- 
sive tags are similar and should appear nearby. Given a similarity measure w between 
tags, we want to minimize Y,p,q w (Pi<l)d(p,q) where d(p,q) is a distance function be- 
tween the two tags in the list and the sum is over all tags. Ideally, d(p,q) should be the 
physical distance between the tags as they appear in the browser; we model this distance 
with the index distance: if tag a appears at index i in the list and tag b appears at index 
j, their distance is the integer \i — j\. This optimization problem is an instance of the 
NP-complete minimum linear arrangement (MLA) problem: an optimal linear 
arrangement of a graph G = (V,E), is a map / from V onto {1,2, ... ,N} minimizing 
Uvev\f(u)-f(v)\. 

Proposition 1. The browser-oblivious tag-cloud optimization problem is NP -Complete. 

There is an 0( ylog n log log ^-approximation for the MLA problem l36l in some 
instances. However, for our generic purposes, the greedy NEAREST NEIGHBOR (NN) 
algorithm might suffice: insert any tag in an empty list, then repeatedly append a tag 
most similar to the latest tag in the list, until all tags have been inserted. It runs in 
0(n 2 ) time where n is the number of tags. Another heuristic for the MLA problem 
is the PAIRWISE EXCHANGE MONTE CARLO (PWMC) method 1 37]: after applying 
NN, you repeatedly consider the exchange of two tags chosen at random, permuting 
them if it reduces the MLA cost. Another MONTE CARLO (MC) heuristic begins with 
the application of NN [38]: cut the list into two blocks at a random location, test if 
exchanging the two blocks reduces the MLA cost, if so proceed; repeat. 

Additional display hints can be inserted in this list. For example, if two tags must 
absolutely be very close to each other, a glued token could be inserted. Also, if two 
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tags can be permuted freely in the list, then a permutable token could be inserted: 
the list could take the form of a PQ tree |39l . 



6 Experiments 

Throughout these experiments, we used the Java version 1.6.0_02 from Sun Microsys- 
tems Inc. on an Apple MacPro machine with 2 Dual-Core Intel Xeon processors running 
at 2.66 GHz and 2 GiB of RAM. 

6.1 Iceberg-Based Computation 

To validate the generation of tag clouds from icebergs, we have run tests over the 
US Income 2000 data set [40] (42 dimensions and about 2 x 10 5 facts) as well as 
a synthetic data set (18 dimensions and 2 x 10 4 facts) provided by Swivel ( |http:| 
|//www. swivel . com/data_sets/show/10Q2247 ). Figure [6] shows that while some 
tag-cloud computations require several minutes, iceberg-based computations can be 
much faster. 

1000 
100 

1 10 

o 

£ 1 
0.1 
0.01 

3456789 10 11 

# of dimensions 

Fig. 6. Computing tag clouds from original data vs. icebergs: iceberg limit value set at 150 and 
tag-cloud size is 9 (US Income 2000). 

From each data set, we generated a 4-dimensional data cube. We used the COUNT 
function to aggregate data. Tag clouds were computed from each data cube using the 
iceberg approximation with different values of limit: the number of facts retained. We 
also implemented exact computations using temporary tables. We specified different 
values for tag-cloud size, limiting the maximum number of tags. For each iceberg limit 
value and tag-cloud size, we computed the entropy of the tag cloud, the false-positive 
and false-negative indexes, and processing time for both of iceberg approximation and 
exact computation. 

We plotted in Figure [7] the false-positive and false-negative indexes as a function 
of the relative entropy (entropy/log (tag-cloud size)) using various iceberg limit values 
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(150, 600, 1200, 4800, and 19600) and various tag-cloud sizes (50, 100, 150, and 200), 
for a total of 20 tag clouds per dimension. The Y axis is in a logarithmic scale. Points 
having their indexes equal to zero are not displayed. As discussed in Section [4j false- 
positive and false-negative indexes should be low when the entropy is low. We verify 
that for low-entropy values (< | log (tag-cloud size)), the indexes are always close to 
zero which indicates a good approximation. Meanwhile, small iceberg cuboids can be 
processed much faster. 



State(52) 
Middlelnitial (26) 
Surname (7270) 



City (4102) a 




■ ■ 




■ ■ 



0.2 0.4 0.6 0.8 

entropyAog( tag-cloud size) 

(a) Swivel 



Country of birth (43) 
Age (91) 
Capital losses (1478) 
Household ( 



0.2 0.4 0.6 0.8 

entropy '/log( tag-cloud size) 

(b) US Income 2000 



Fig. 7. False-negative and false-positive indexes (0 is best, 1 is worst), values under 0.0001 are 
not included 



n 



No Clustering i 
NN E 
PWMC10 i 
PWMC100 c 
PWMC1000 t 



tm 



No Clustering i 
NN [ 
PWMC10 I 
PWMC100 [ 
PWMC1000 [ 



Ik 



(a) Displaying dimension "Givenname" and (b) Displaying dimension "HHDFMX" and 
clustering by "State" (Swivel) clustering by "ARACE" (US Income 2000) 



Fig. 8. MLA costs for two examples: the PWMC heuristic was applied using 10, 100 and 1000 
random exchanges. 



Experimentally, we found that the entropy is not sensitive to the iceberg limit, but it 
grows with the tag-cloud size (see Figures 9(a) and 10(a)| ). Naturally, the tag-cloud size 
is bounded by the cardinality of the chosen dimension. 
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Fig. 9. Benchmarking iceberg computation over Swivel 
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(b) Processing time 



Fig. 10. Benchmarking iceberg computation over US Income 2000 



We computed the relative gain in processing time due to the iceberg limit as(t — t f )/t 
where t is the time required for the exact computation whereas t' is the time used by 
an iceberg computation (see Figures 9(b) and 10(b)| ). For these tests, the tag-cloud size 
was set to 150. Generally, the lower the iceberg limit value, the better the gain. High 
cardinality dimensions benefit less from a small iceberg limit. Also, the ratios of false 
positive and false negative decrease as the iceberg limit increases. However, for low- 
cardinality dimensions, these ratios are often close to zero, so only high-cardinality 
dimensions benefit from higher iceberg limits. Hence, you should choose an iceberg 
limit small or large depending on whether you have a low or high cardinality dimension. 



6.2 Similarity Computation 

Using our two data sets, we tested the NN, PWMC, and MC heuristics using both the 
cosine and the Tanimoto similarity measures. From data cubes made of all available 
dimensions, we used all possible 1-tag clouds, using successively all other dimensions 
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as clustering dimension for a total of2x (18 x 17 + 42 x 41) = 4056 layout optimiza- 
tions. The iceberg limit value was set at 150. The MC heuristic never fared better than 
NN, even when considering a very large number of random block permutations: we 
rejected this heuristic as ineffective. However, as Figure [8] shows, the PWMC heuristic 
can sometimes significantly outperform NN when a large number (1000) of tag ex- 
changes are considered, but it only outperforms NN by more than 20% in less than 5% 
of all layout optimizations. Meanwhile, table [3] shows that if our objective is to reduce 
the ML A cost by 90%, all heuristics are equivalent. However, it also shows that PWMC 
can be several order of magnitudes slower than NN: NN is 10 times faster than PWMC 
with 100 exchanges and 70 times faster than PWMC with 1000 exchanges. Computing 
the similarity function over an iceberg cuboid was moderately expensive (0.07 s) for 
a small iceberg cuboid (limit set to 150 cells): the exact computation of the similarity 
function can dwarf the cost of the heuristics (NN and PWMC) over a moderately large 
data set. Informal tests suggest that NN computed over a small iceberg cuboid provides 
significant visual layouts. 

Table 3. Comparison of various MLA heuristics over the Swivel data set using the cosine sim- 
ilarity measure (306 tag clouds). The running time is the average of 100 optimizations for tag 
clouds of size 150. The number of tag clouds (out of 306) having at least a given gain is given. 





NN 


PWMC 






10 100 1000 


time (s) 


0.003 


0.01 0.03 0.2 


MLA gain > 0% 


154 


154 154 154 


MLA gain > 30% 


143 


143 145 148 


MLA gain > 70% 


112 


112 112 116 


MLA gain > 90% 


97 


97 97 99 



7 Conclusion 

According to our experimental results, precomputing a single iceberg cuboid per data 
cube allows to generate adequate approximate tag clouds online. Combined with mod- 
ern Web technologies such as AJAX and JSON, it provides a responsive application. 
However, we plan to make more precise the relationship between iceberg cubes, en- 
tropy, dimension sizes, and our quality indexes. Yet another approach to compute tag 
clouds quickly may be to use a bitmap index ATI . While we built a Web 2.0 with sup- 
port for numerous collaborations features such as permalinks, tag-cloud embeddings 
with iframe elements, we still need to experiment with live users. Our approach to mul- 
tidimensional tag clouds has been to rely on &-tags. However, this approach might not 
be appropriate when a dimension has a linear flow such as time or latitude. A more 
appropriate approach is to allow the use of a slider |22) tying several tag clouds, each 
one corresponding to a given attribute value. 
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